Audio augmentation for speech recognition
نویسندگان
چکیده
Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 960 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.
منابع مشابه
Deep Beamforming and Data Augmentation for Robust Speech Recognition: Results of the 4th CHiME Challenge
Robust automatic speech recognition in adverse environments is a challenging task. We address the 4 CHiME challenge [1] multi-channel tracks by proposing a deep eigenvector beamformer as front-end. To train the acoustic models, we propose to supplement the beamformed data by the noisy audio streams of the individual microphones provided in the real set. Furthermore, we perform data augmentation...
متن کاملExploring Data Augmentation for Improved Singing Voice Detection with Neural Networks
In computer vision, state-of-the-art object recognition systems rely on label-preserving image transformations such as scaling and rotation to augment the training datasets. The additional training examples help the system to learn invariances that are difficult to build into the model, and improve generalization to unseen data. To the best of our knowledge, this approach has not been systemati...
متن کاملTwo-Stage Data Augmentation for Low-Resourced Speech Recognition
Low resourced languages suffer from limited training data and resources. Data augmentation is a common approach to increasing the amount of training data. Additional data is synthesized by manipulating the original data with a variety of methods. Unlike most previous work that focuses on a single technique, we combine multiple, complementary augmentation approaches. The first stage adds noise a...
متن کاملAcoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation
In recent years, neural network approaches have shown superior performance to conventional hand-made features in numerous application areas. In particular, convolutional neural networks (ConvNets) exploit spatially local correlations across input data to improve the performance of audio processing tasks, such as speech recognition, musical chord recognition, and onset detection. Here we apply C...
متن کاملVideo Augmentation for Improving Audio Speech Recognition under Noise
For the recognition of speech, in particular spoken digits, captured in video with poor sound due to noise, we develop a novel audio-visual fusion technique that performs significantly better than utilising either audio or video signal alone. Specifically, we present an audio-visual intermediate fusion strategy to locate speaker dependant pronounced digits in continuous video recorded with soun...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015